Data Science for Public Policy

Final Project

Author

Olivia Gomez, Alicia Helfrich, Allie Gleich, Edward Malcolm

Final Project

The Importance of Health Care Access and Economic Stability for Immigrant Communities in the United States

Immigration is deeply embedded into the history of the United States. For centuries, the arrival of individuals born abroad has brought new ideas, cultures, and economic/social benefits to the nation. Despite this, there is evidence that many of these individuals experience additional barriers to accessing the economic and social opportunities that exist, and experience high economic and health disparities as compared to those born in the United States.

Research delving into the historical health and economic disparities that face immigrant communities has focused on limited access to health insurance coverage and negative economic outcomes related to poverty, like food insecurity and housing instability (Chang, 2019). According to Derose et. al., immigrants “have lower rates of health insurance, use less health care, and receive lower quality of care than U.S.-born populations” (2007). “Foreign-born residents of America’s suburbs experienced markedly higher poverty rates (14.1 percent) than the U.S. born (9.8 percent) in 2009,” according to Suro et. al. (2011). Two factors that contribute to physical well being is access to health insurance and access to health centers. Health insurance facilitates access to care and is associated with lower death rates, better health outcomes, and improved productivity. U.S. residents obtain health coverage from a variety of private and public sources, such as through their employers or direct purchase on the individual market (private sources), as well as through the Medicare, Medicaid, or Veterans Affairs programs (public sources) (American Hospital Association).

Additionally, the US immigrant population is not monolithic, and it is important to explore factors that can influence an individuals health or economic status such as documentation status or how many years they have spent in the US. The important of these factors has been illustrated by prior research: A May 2022 study by the Guttmacher Institute, used the 2019 American Community Survey and the 2017−2018 National Health Interview Survey to conclude that there exist disparities by immigration status in access to health insurance, as well as primary and reproductive health care. In another study that explored health insurance coverage of U.S. immigrants, Dr. Ku analyzed the differences between established immigrants versus recent immigrants (Ku, 2011). Recent immigrants were categorized as immigrants who had been in the U.S. for less than a decade, while established immigrants were categorized as immigrants who had been in the U.S. for over a decade. When comparing the health insurance status of these two groups, Ku found that only 44% of recent immigrants were fully covered by health insurance, where established immigrants were at 63%. These results reveal the impact time in America has on the insurance coverage of immigrants.

Research Question

Through exploratory data analysis, supervised and unsupervised machine learning, and geospatial analysis, this research project strives to explore factors that contribute to the health and economic status of immigrant communities. Specifically, we sought to understand health care access in immigrant communities, understanding how the year of arrival influences ones income levels, and exploring differentiating factors that could influence health/economic status outside of only being born in the United States. With an increased understanding of these factors, policy makers are more effectively address disparities that exist, as well as identify protective factors that support immigrants thriving within their communities.

Sources

Chang, C. D. (2019). Social determinants of health and health disparities among immigrants and their children. Current problems in pediatric and adolescent health care, 49(1), 23-30.

Derose, K. P., Escarce, J. J., & Lurie, N. (2007). Immigrants and health care: sources of vulnerability. Health affairs, 26(5), 1258-1268.

Fuentes, L., Desai, S., and Dawson, R.. (2022). “New Analyses on US Immigrant Health Care Access Underscore the Need to Eliminate Discriminatory Policies.” Www.guttmacher.org, May. https://doi.org/10.1363/2022.33551.

Ku, Leighton. (2011). Health Insurance Coverage and Medical Expenditures of Immigrants and Native-Born Citizens in the United States.American Journal Of Public Health, 99(7), 1322-1328.

Pourat, N., Wallace, S. P., Hadler, M. W., & Ponce, N. (2014). Assessing health care services used by California’s undocumented immigrant population in 2010. Health Affairs, 33(5), 840-847

Schumacher, S., and Presiado, M. (2023). “Health and Health Care Experiences of Immigrants: The 2023 KFF/LA Times Survey of Immigrants.” KFF. September 17, 2023. https://www.kff.org/racial-equity-and-health-policy/issue-brief/health-and-health-care-experiences-of-immigrants-the-2023-kff-la-times-survey-of-immigrants/.

Suro, R., Wilson, J. H., & Singer, A. (2011). Immigration and poverty in America’s suburbs. Metropolitan Opportunity Series, 1-21.

2. Data Sources

Library

library(ipumsr)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(VGAM)
library(rpart)
library(rpart.plot)
library(hardhat)
library(ranger)
library(recipes)
library(srvyr)
library(haven)
library(survey)
library(tidyclust)
library(GGally)
library(patchwork)
library(dotenv)
library(httr)
library(jsonlite)
library(tidycensus)
library(tigris)
library(lubridate)
library(patchwork)
library(stringr)
library(dplyr)
library(themis)
library(parsnip)
library(rsample)
library(yardstick)
library(vip)
library(kknn)
library(caret)
library(httr)
library(jsonlite) 
library(lubridate)
library(sf)
library(readr)
library(stringr)
library(janitor)
library(recipes)
library(RColorBrewer)

set.seed(20231215)

Current Population Survey - 2022

We are utilizing the Current Population Survey from 2022 as a major source of data for our analysis to explore factors that would contribute to an immigrant’s income or health status. This survey is conducted each year to capture statistical information on the population. The units of measurement are individual adults. The survey contains information on immigration demographic characteristics in addition to health insurance status. Below, we will utilize unsupervised machine learning techniques to analyze the relationship between health and demographic characteristics and estimate a model, given that the survey has many correlated predictors. We will also use the data in a supervised machine learning capacity to predict health coverage status based on several demographic predictors.

We used the IPUMS API to download the 2022 CPS data. We read in the data using the ipumsr package. To read in the data, an IPUMS API key is required and read in using the dotevn package.

#using IPUMS API to access CPS ASEC 2022 data

dotenv::load_dot_env(file = ".env")

# Access the values
ipums_key <- Sys.getenv("ipums_api_key")

cps_extract_request <- define_extract_cps(
  description = "IPUMS-CPS, ASEC 2022",
  samples = c("cps2022_03s"),
  variables = c("CPSID", "ASECFLAG", "ASECWTH", "PERNUM", 
  "CPSIDV", "CPSIDP", "ASECWT", "AGE", 
  "SEX", "RACE", "MARST", "POPSTAT", "ASIAN", 
  "FAMSIZE", "NCHILD", "NCHLT5", "BPL", "YRIMMIG", 
  "CITIZEN", "MBPL", "FBPL", "NATIVITY", "HISPAN", 
  "EMPSTAT", "LABFORCE", "EDUC", "FTOTVAL", "INCTOT", 
  "INCWAGE", "OFFPOV", "MIGSTA1", "MIGRATE1", 
  "HIMCAIDNW", "HIMCARENW", "ANYCOVNW", "ANYCOVLY", "HHINCOME", "PRVTCOVNW", 
  "GRPCOVNW", "DPCOVNW", "MRKCOVNW", "TRCCOVNW", "INHCOVNW")
)

submitted_extract <- submit_extract(extract = cps_extract_request, api_key = Sys.getenv("ipums_api_key"))
downloadable_extract <- wait_for_extract(extract = submitted_extract, api_key = Sys.getenv("ipums_api_key"))
data_files <- download_extract(downloadable_extract, api_key = Sys.getenv("ipums_api_key"))

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |======================================================================| 100%

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |======================================================================| 100%
file_number <- downloadable_extract[["number"]]

ddi_file <- paste0("cps_000", file_number, ".xml")

#Loading in the CPS 2022 data from IPUMS API
ddi <- read_ipums_ddi(ddi_file)
cps <- read_ipums_micro(ddi)

American Communities Survey (ACS) 2017-2022, County-Level Selected Demographics

Our team utilized the 2022, 5-year survey American Communities Survey to analyze data on select demographic features of US counties. Particular variables of interest were overall population size of counties, as well as percent of counties that were born abroad and percent of individuals that identified with various demographic groups (race, nativity, ethnicity). Additionally, we pulled data on the percent of individuals with access to health insurance, percent employed, and percent with access to internet.

# loading data 

variables <- tidycensus::load_variables(2022, "acs5")

acs_2022 <-
  get_acs(
  geography = "county",
  variables = c(
  "B01003_001", # = Total Population
  "DP02_0090PE", # = % born in the US
  "DP02_0093PE", # = % born in PR
  "DP02_0094PE", # = % foreign born
  "DP02_0100PE", # = % Native - Entered 2010 or later
  "DP02_0101PE", # = % Native - Entered before 2010
  "DP02_0103PE", # = % Foreign - Entered 2010 or later
  "DP02_0104PE", # = % Foreign - Entered before 2010
  "DP02_0106PE", # = % Foreign - Europe
  "DP02_0107PE", # = % Foreign - Asia
  "DP02_0108PE", # = % Foreign - Africa
  "DP02_0109PE", # = % Foreign - Oceania
  "DP02_0110PE", # = % Foreign - Latin America
  "DP02_0111PE", # = % Foreign - North America
  "DP05_0037PE", #= % white
  "DP05_0038PE", # = % black
  "DP05_0039PE", # = % american indian
  "DP05_0044PE", # = % Asian
  "DP05_0052PE", # = % Native american pacific islander
  "DP02_0065PE", # = % with a bachelors or above
  "DP03_0062E",  # = Median houshold income
  "DP02_0152PE", # = % that uses internet use
  "DP03_0004PE", # = % employed
  "DP03_0095PE" # = % with health care coverage
  ),
  geometry = TRUE,
  year = 2022, 
  survey = "acs5",
  progress = FALSE
) %>% 
  janitor::clean_names() %>% 
    mutate( 
      variable = case_when(
      variable == "B01003_001" ~ "total_pop",
      variable == "DP02_0090P" ~ "born_us",
      variable == "DP02_0093P" ~ "born_pr",
      variable == "DP02_0094P" ~ "born_abroad",
      variable == "DP02_0100P" ~ "native_after_2010",
      variable == "DP02_0101P" ~ "native_before_2010",
      variable == "DP02_0103P" ~ "foreign_after_2010",
      variable == "DP02_0104P" ~ "foreign_before_2010",
      variable == "DP02_0106P" ~ "nativity_europe",
      variable == "DP02_0107P" ~ "nativity_asia",
      variable == "DP02_0108P" ~ "nativity_africa",
      variable == "DP02_0109P" ~ "nativity_oceania",
      variable == "DP02_0110P" ~ "nativity_latin",
      variable == "DP02_0111P" ~ "nativity_north_amer",
      variable == "DP05_0037P" ~  "race_white",
      variable == "DP05_0038P" ~ "race_black",
      variable == "DP05_0039P" ~ "race_native_amer",
      variable == "DP05_0044P" ~ "race_asian",
      variable == "DP05_0052P" ~ "race_pacific",
      variable == "DP02_0065P" ~ "educ_bachelors",
      variable == "DP03_0062" ~ "median_income",
      variable == "DP02_0152P" ~ "internet_access",
      variable == "DP03_0004P" ~ "employed",
      variable == "DP03_0095P"~ "health_insurance"
              )) %>%
  select(- "moe") %>% 
  #unpivoting columns for each variable
 pivot_wider(names_from = "variable", values_from = "estimate")

Health Resources and Services Administration (HRSA) Community Health Center Data

In order to understand the access to health care through access to community health centers, we are utilizing US Health and Services Administration comprehensive listing of federally-qualified community health centers, or unregistered equivalent organizations. The dataset captures data at the “center”-level, where each observation represents a different health center. It is updated on an annual basis, and our team utilized the most recently available information. The dataset includes variables such as address, congressional representative, and latitude and longitude for each center.

#importing from Health and Services Administration

hrsa_health_centers <- read_csv( paste0("https://data.hrsa.gov//DataDownload/DD_Files/","Health_Center_Service_Delivery_and_LookAlike_Sites.csv" )) %>% 
  janitor::clean_names() 
  
# filtering out Y coordinate rows with Y
hrsa_health_centers <- hrsa_health_centers %>%
  filter(geocoding_artifact_address_primary_y_coordinate != "Y") %>%
  filter(geocoding_artifact_address_primary_y_coordinate != "N")     

US Census County Shapefiles

In order to set our analysis up for geospacial analysis and an assessment of community health centers by US counties, we utilized the US Census’ Shapefiles available at the county level. This dataset was utilized to join together the American Communities Survey and Heath Resources and Services Administration’s dataset so that information could be aggregated at the county-level. Please note, the following section of code may need to be run twice in order for the “COUNTY” file to be downloaded and created.

# URL to the shapefile zip file
county_url <- "https://www2.census.gov/geo/tiger/TIGER2022/COUNTY/tl_2022_us_county.zip"

# Specify the destination folder where you want to save the downloaded file
destination_folder <- "COUNTY/tl_2022_us_county.zip"

# Download the shapefile zip file
download.file(county_url, destfile = file.path(destination_folder, "tl_2022_us_county.zip"), mode = "wb")

# Unzip the downloaded file
unzip(file.path(destination_folder, "tl_2022_us_county.zip"), exdir = destination_folder)

# Read the shapefile into R
shapefile <- sf::st_read(file.path(destination_folder, "tl_2022_us_county.shp")) %>% 
  janitor::clean_names()%>% 
  select(geometry,geoid) 
Reading layer `tl_2022_us_county' from data source 
  `/Users/oliviagomez/Desktop/Intro to Data Science/final_project/COUNTY/tl_2022_us_county.zip/tl_2022_us_county.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3235 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
Geodetic CRS:  NAD83

3. Data Wrangling and Exploratory Analysis

This section should include your code to perform any data cleaning and new variable creation. You should also thoroughly explore your data, including assessing the presence of outliers/unexpected values and identifying and appropriately addressing missingness in any key variables.

CPS Data Wrangling and Exploratory Analysis

# remove variables that won't be used for analysis

cps <- cps %>%
  janitor::clean_names() %>%
  select(-year, -month, -mbpl, -fbpl, -asian, -asecflag, -cpsidv)


# create variable for health insurance status 

cps <- cps %>%
  mutate(healthinsu = case_when(
    himcaidnw == 1 ~ "medicaid", 
    himcarenw == 1 ~ "medicare",
    prvtcovnw == 1 ~ "private",
    grpcovnw == 1 ~ "employment based",
    dpcovnw == 1 ~ "direct purchase",
    mrkcovnw == 1 ~ "marketplace",
    trccovnw == 1 ~ "tricare", 
    inhcovnw == 1 ~ "indian health service",
    TRUE ~ NA_character_  
  )) 

# ensuring each category is labeled

cps <- cps %>%
  mutate(healthinsu = factor(healthinsu, levels = c(
    "medicaid", "medicare", "private", "employment based", 
    "direct purchase", "marketplace", "tricare", 
    "indian health service"
  ), labels = c(
    "Medicaid", "Medicare", "Private", "Employment Based", 
    "Direct Purchase", "Marketplace", "Tricare", 
    "Indian Health Service"
  )))

# removing original dummy variables for health insurance status

cps <- cps %>%
  filter(!is.na(anycovnw)) %>%
  select(-himcaidnw, -himcarenw, -prvtcovnw, -grpcovnw, -dpcovnw, -mrkcovnw, -trccovnw, -inhcovnw) %>%
  mutate(healthinsu = as_factor(healthinsu))


#Remove NAs or NIUs coded as numbers

cps <- subset(cps, anycovly != 99)
cps <- subset(cps, empstat != 00)
cps <- subset(cps, hhincome != 99999999)
cps <- subset(cps, sex != 9)
cps <- subset(cps, race != 999)
cps <- subset(cps, marst != 9)
cps <- subset(cps, bpl != 99999)
cps <- subset(cps, citizen != 9)
cps <- subset(cps, hispan != 902)
cps <- subset(cps, labforce != 0)
cps <- subset(cps, educ != 001)
cps <- subset(cps, educ != 999)
cps <- subset(cps, ftotval != 9999999999)
cps <- subset(cps, inctot != 9999999999)
cps <- subset(cps, inctot != 9999999998)
cps <- subset(cps, incwage != 9999999999)
cps <- subset(cps, incwage != 9999999998)
cps <- subset(cps, offpov != 99)
cps <- subset(cps, migsta1 != 00)
cps <- subset(cps, migrate1 != 0)

# dropping popstat variable as it has no variance 

cps <- cps %>%
  select(-popstat)

# final check for any NAs

cps_na <- is.na(cps)

missing_counts <- colSums(cps_na)

print(missing_counts[missing_counts > 0])
named numeric(0)

ACS, HRA, and Census Data Wrangling

Assessing Missingness:

  • The HRSA dataset was one of the messier datasets we worked with, and it had a number of data missing across a range of variables. However, we prioritized the variables that would be of core importance to the analysis. While a number of health centers didn’t have a phone number listed, for example, we didn’t feel that this justified removing it all together.

It is important to note that the HRSA dataset, at times, inconsistently pulled longitude and latitude variables, likely due to regular updates to the data that are being made. In total there were around 6,000 observations that either contained null values in the longitude and latitude fields, or had the values “N” and “Y”. Regarding the latter, one could assume that HRSA’s data input shifted column values from the immediate variables to the right of latitude and longitude columns, as the only valid values for this field were “N” and “Y”. We removed these values to ensure the analysis consistently pulls metrics.

  • The ACS dataset is a well-prepared data source provided by the US Census Bureau and therefore required minimal cleaning. Similar to above, there were a relatively small number of missing values for the most consequential variables: born_abroad, nativity fields, and born_us. There were 78 places in which all born_us, born_abroad, and born_pr were missing. Upon further investigation we found that all of these variables were for Puerto Rican counties. Given that strong analysis could not be conducted for Puerto Rico for this reason, these observations were dropped. Additionally, other variables missing from this dataset were due to a “logic” built into the preparation of the data. For example, if a county reported that 100% of individuals were us born, then nativity fields were left with null values. These observations were not touched.

  • Similar to the ACS, the Census Shapefiles are extremely ‘tidy’ datasets. There were no missing values with the Shapefiles.

##HRSA Exploratory Analysis 

hrsa_na <- is.na(hrsa_health_centers)

hrsa_missing_counts <- colSums(hrsa_na)

print(hrsa_missing_counts[hrsa_missing_counts > 0])
                                         health_center_number 
                                                            1 
                    bhcmis_organization_identification_number 
                                                            1 
                                             site_postal_code 
                                                            5 
                                        site_telephone_number 
                                                           18 
                                             site_web_address 
                                                         4772 
                                     operating_hours_per_week 
                                                          443 
                            fqhc_site_medicare_billing_number 
                                                         6172 
                                         fqhc_site_npi_number 
                                                         7640 
                                site_added_to_scope_this_date 
                                                            1 
                                           health_center_name 
                                                            1 
                    health_center_organization_street_address 
                                                            1 
                              health_center_organization_city 
                                                            1 
                             health_center_organization_state 
                                                           18 
                          health_center_organization_zip_code 
                                                            1 
                        grantee_organization_type_description 
                                                          445 
       migrant_health_centers_hrsa_grant_subprogram_indicator 
                                                          445 
             community_health_hrsa_grant_subprogram_indicator 
                                                          445 
   school_based_health_center_hrsa_grant_subprogram_indicator 
                                                          445 
               public_housing_hrsa_grant_subprogram_indicator 
                                                          445 
 health_care_for_the_homeless_hrsa_grant_subprogram_indicator 
                                                          445 
                           u_s_mexico_border_county_indicator 
                                                            5 
state_and_county_federal_information_processing_standard_code 
                                                            5 
                                       county_equivalent_name 
                                                            5 
                                           county_description 
                                                            5 
                        u_s_congressional_representative_name 
                                                           22 
                               name_of_u_s_senator_number_one 
                                                          182 
                               name_of_u_s_senator_number_two 
                                                          182 
                                                          x62 
                                                        11287 
##Special examination of latitude and longitude values

  #longitudes
    #sum(is.na(hrsa_health_centers$geocoding_artifact_address_primary_x_coordinate))
    
    #x_cord_missing <- subset(hrsa_health_centers, is.na(geocoding_artifact_address_primary_x_coordinate)) %>% 
      #print(x_cord_missing)

  #latitudes
    #sum(is.na(hrsa_health_centers$geocoding_artifact_address_primary_y_coordinate))
    
   # y_cord_missing <- subset(hrsa_health_centers, is.na(geocoding_artifact_address_primary_y_coordinate))

#there are 25 values missing out of 17192. It isn't isolated to only one state and other data is missing. Will remove those rows
    
hrsa_health_centers <- 
  subset(hrsa_health_centers, !is.na(geocoding_artifact_address_primary_y_coordinate)) #y and x coordinates are missing from same observations, so can just eliminate based on one condition

hrsa_health_centers <- 
  subset(hrsa_health_centers, !is.na(geocoding_artifact_address_primary_x_coordinate))
### ACS Missingness Examination

acs_na <- is.na(acs_2022)

acs_missing_counts <- colSums(acs_na)

print(acs_missing_counts[acs_missing_counts > 0])
     educ_bachelors             born_us             born_pr         born_abroad 
                 78                  78                  78                  78 
  native_after_2010  native_before_2010  foreign_after_2010 foreign_before_2010 
                169                 169                  93                  93 
    nativity_europe       nativity_asia     nativity_africa    nativity_oceania 
                 93                  93                  93                  93 
     nativity_latin nativity_north_amer     internet_access       median_income 
                 93                  93                  78                   1 
# counties in Puerto Rico were not provided with estimate numbers
born_us_missing <- subset(acs_2022, is.na(born_us))

#removing Puerto Rico values
acs_2022 <- 
  subset(acs_2022, !is.na(born_us))
#### Census Missingness Examination

geo_na <- is.na(shapefile)

geo_missing_counts <- colSums(geo_na)

print(geo_missing_counts[geo_missing_counts > 0])
named numeric(0)

In regards to the data wrangling for these datasets, the ultimate goal was to create a completed, merged data frame that compiles demographic features of each county by the ACS, an aggregate number of health centers in each county from the HRSA, and the corresponding geospacial fields needed to visualize and analyze data at the county level. The following code displays the steps that were taken to do this:

Step 1: Converting HRSA Dataframe to Spacial Dataframe

  #convert hrsa lat/long points to geometry
  hrsa_health_centers_geo <- 
    hrsa_health_centers %>% 
    st_as_sf(coords = c("geocoding_artifact_address_primary_x_coordinate", "geocoding_artifact_address_primary_y_coordinate"), remove = FALSE, crs = 4269)

Step 2: Join HRSA SF Dataframe with Census Shapefile, Creating a field for “County” within the HRSA dataset

  # new data with geoid
  hrsa_joined  <- 
  st_join(hrsa_health_centers_geo, shapefile, join = st_within)

Step 3: Group the HRSA SF file by county, creating a field that sums the number of health centers per county

  # Create column that sums the total number of health centers
  hrsa_grouped <-  
  hrsa_joined %>%
    group_by(geoid) %>%
    summarize(
      total_centers = n())
  
      #verify that sum of centers = number of rows in hrsa dataset
      hrsa_grouped %>% 
        summarize(sum(total_centers)) %>% 
        print()
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: -176.6583 ymin: -14.319 xmax: 167.737 ymax: 68.35013
Geodetic CRS:  NAD83
# A tibble: 1 × 2
  `sum(total_centers)`                                                  geometry
                 <int>                                          <MULTIPOINT [°]>
1                11287 ((134.4761 7.3422), (134.4872 7.346), (-67.84656 46.1230…

Step 4: Join county-level HRSA SF with the ACS dataset, creating new completed Shapefile

#verify that acs data is the same CRS as the joined_data
st_crs(acs_2022)
Coordinate Reference System:
  User input: NAD83 
  wkt:
GEOGCRS["NAD83",
    DATUM["North American Datum 1983",
        ELLIPSOID["GRS 1980",6378137,298.257222101,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4269]]
#table(acs_2022$geoid)

acs_joined  <- 
  st_join(acs_2022, hrsa_grouped, left = TRUE) %>% 
  #replace null values with 0
  mutate(total_centers = ifelse(is.na(total_centers), 0, total_centers))

#Total centers in joined data set very closely matches that of the hrsa dataset, implies that left join is justified 
acs_joined %>% 
  summarize(sum(total_centers)) %>% 
  knitr::kable(digits = 2)
sum(total_centers) geometry
11155 MULTIPOLYGON (((-76.41964 3…

Step 5: Create a non-shapefile version of prior dataset This was used for analysis that will use county-level health center data, but not creating visuals.

Note: In preparation for later analysis, a new variable was created via a left-join, that articulates what the predominant nativity of immigrants in a particular county was.

acs_joined_nonspac   <- st_drop_geometry(acs_joined) 

  #creating a variable to indicate where, if there is an immigrant population, what nativity accounts for the highest proportion of immigrants
  
  nativity_col <- acs_joined %>% 
    select(geoid.x,nativity_europe,nativity_asia,nativity_africa,nativity_oceania,nativity_latin,nativity_north_amer) %>% 
    pivot_longer(
            cols = c(nativity_europe,nativity_asia,nativity_africa,nativity_oceania,nativity_latin,nativity_north_amer),
            
            names_to = "top_nativity",
            
            values_to = "percent"
            ) %>% 
    group_by(geoid.x) %>% 
    slice(which.max(percent))

  #add "top_nativity" column to the final, non-spacial dataset
  acs_joined_nonspac  <- 
    left_join(acs_joined_nonspac,nativity_col,by = "geoid.x")

  #remove null values that were caused due to a 0% immigrant population
  acs_joined_nonspac <- acs_joined_nonspac %>% 
     mutate(top_nativity = ifelse(is.na(top_nativity),"no_immigrant", top_nativity)) 

4. Data Analysis

Data Analysis: This section should include the code to conduct analysis to answer your question of interest. It should include writing explaining why the tools you selected are a good fit for the research question and a justification of key analytic decisions. If you are using machine learning methods, this section should include model evaluation using the methods discussed in class.

Exploring Factors that Influence Economic and Health Status

Part A: Unsupervised Machine Learning using data from the CPS

# ensuring all variables are numeric

cps <- cps %>%
  mutate_all(as.numeric)

# correlation plot of data to highlight multicollinearity 

correlation <- cps %>%
  select(-serial, -cpsid, -asecwth, -pernum, -cpsidp, -asecwt) %>%
  cor()

corrplot(correlation, method = 'color')

# creating PCA recipe 
rec_pca <- recipe(
  formula = ~ .,
  data = cps) %>%
  update_role(serial, cpsid, asecwth, pernum, cpsidp, asecwt, new_role = "Id") %>%
  step_scale(all_predictors()) %>%
  step_pca(all_predictors(), num_comp = 5) %>%
  prep(data = cps) 

# apply estimated loadings to original data
cps_pca <- rec_pca %>%
  bake(new_data = cps)

# Extract Loadings
coef <- rec_pca %>%
  tidy(type = "coef", number = 2)

# Extract variance explained
variance <- rec_pca %>%
  tidy(type = "variance", number = 2) %>%
  filter(terms == "variance") %>%
  mutate(pct_var = value/sum(value)) %>%
  slice_head(n = 10) 

print(variance)
# A tibble: 10 × 5
   terms     value component id        pct_var
   <chr>     <dbl>     <int> <chr>       <dbl>
 1 variance 185.           1 pca_K9kMA 0.884  
 2 variance   4.02         2 pca_K9kMA 0.0192 
 3 variance   3.79         3 pca_K9kMA 0.0181 
 4 variance   2.21         4 pca_K9kMA 0.0105 
 5 variance   1.91         5 pca_K9kMA 0.00914
 6 variance   1.81         6 pca_K9kMA 0.00863
 7 variance   1.53         7 pca_K9kMA 0.00732
 8 variance   1.47         8 pca_K9kMA 0.00703
 9 variance   1.13         9 pca_K9kMA 0.00539
10 variance   1.05        10 pca_K9kMA 0.00502
coefprint <- coef %>%
  filter(component %in% c("PC1", "PC2", "PC3", "PC4")) %>%
  group_by(component) %>%
  top_n(n = 3, wt = value)

print(coefprint)
# A tibble: 12 × 4
# Groups:   component [4]
   terms      value component id       
   <chr>      <dbl> <chr>     <chr>    
 1 nchlt5   -0.0233 PC1       pca_K9kMA
 2 yrimmig  -0.0360 PC1       pca_K9kMA
 3 hispan   -0.0260 PC1       pca_K9kMA
 4 yrimmig   0.363  PC2       pca_K9kMA
 5 citizen   0.366  PC2       pca_K9kMA
 6 nativity  0.355  PC2       pca_K9kMA
 7 yrimmig   0.301  PC3       pca_K9kMA
 8 citizen   0.289  PC3       pca_K9kMA
 9 incwage   0.308  PC3       pca_K9kMA
10 age       0.371  PC4       pca_K9kMA
11 empstat   0.282  PC4       pca_K9kMA
12 anycovly  0.203  PC4       pca_K9kMA
# The first 4 clusters explain over 90 % of the variation, so I will also examine the data using a K-means alogrithm using 4 clusters.

cps_pca_all <- 
  bind_cols(cps, cps_pca)

# making scatterplot to visualize data

plot1 <- cps_pca_all %>%
  ggplot(mapping = aes(x = PC1, y = PC2, color = citizen)) +
  geom_point() + 
  labs(title = "Scatterplot of PC1 and PC2 by Citizenship",
       x = "Principal Component 1",
       y = "Principal Component 2") + 
  theme(plot.title = element_text(size = 12))

plot2 <- cps_pca_all %>%
  ggplot(mapping = aes(x = PC1, y = PC2, color = healthinsu)) +
  geom_point() + 
  labs(title = "Scatterplot of PC1 and PC2 by Health Insurance Status",
       x = "Principal Component 1",
       y = "Principal Component 2") + 
  theme(plot.title = element_text(size = 12))

plot3 <- cps_pca_all %>%
  ggplot(mapping = aes(x = PC3, y = PC4, color = citizen)) +
  geom_point() + 
  labs(title = "Scatterplot of PC3 and PC4 by Citizenship",
       x = "Principal Component 3",
       y = "Principal Component 4") + 
  theme(plot.title = element_text(size = 12))

plot4 <- cps_pca_all %>%
  ggplot(mapping = aes(x = PC3, y = PC4, color = healthinsu)) +
  geom_point() + 
  labs(title = "Scatterplot of PC3 and PC4 by Health Insurance Status",
       x = "Principal Component 3",
       y = "Principal Component 4") + 
  theme(plot.title = element_text(size = 12))

plot1 + plot2 + plot3 + plot4

# Creating a kmeans model with 4 clusters

kmeans_rec <- recipe(
formula = ~ .,
data = cps
) %>%
  update_role(serial, cpsid, asecwth, pernum, cpsidp, asecwt, new_role = "Id") %>%
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_scale(all_predictors()) %>%
  step_center(all_predictors()) 

kmeans_numeric <- kmeans_rec %>%
prep() %>%
bake(new_data = cps)


# create a kmeans model object four clusters
kmeans_spec <- k_means(
num_clusters = 4 
) %>%
set_engine(
"stats",
nstart = 100, iter.max = 1000
)

# create a workflow
kmeans_wflow <- workflow(
preprocessor = kmeans_rec,
spec = kmeans_spec
)

# fit the model
fit <- kmeans_wflow %>%
fit(data = cps)

# view the model results
tidy(fit) %>%
knitr::kable(digits = 2)
hhincome age sex race marst famsize nchild nchlt5 bpl yrimmig citizen nativity hispan empstat labforce educ ftotval inctot incwage offpov migsta1 migrate1 anycovly anycovnw healthinsu size withinss cluster
0.24 -0.20 -0.08 -0.14 -0.07 0.06 0.15 0.09 -0.38 -0.47 -0.46 -0.41 -0.15 -0.78 0.78 0.26 0.24 0.29 0.36 0.24 0.27 -0.27 0.05 -0.05 -0.19 54404 913033.2 1
-0.29 0.34 0.10 -0.11 0.18 -0.16 -0.35 -0.18 -0.38 -0.46 -0.45 -0.41 -0.16 1.22 -1.22 -0.28 -0.26 -0.38 -0.51 -0.27 0.27 -0.27 0.13 -0.13 0.23 37367 504146.5 2
-0.06 0.09 0.01 0.58 -0.25 0.22 0.27 0.04 1.75 2.11 2.07 1.86 0.69 -0.08 0.08 -0.22 -0.08 -0.07 -0.02 -0.08 0.20 -0.18 -0.34 0.32 0.07 20455 525402.3 3
-0.12 -0.44 0.00 0.00 0.28 -0.21 -0.11 0.12 -0.07 -0.07 -0.06 -0.06 0.01 -0.20 0.20 0.08 -0.18 -0.06 0.00 -0.14 -3.43 3.35 -0.10 0.11 0.01 8388 198878.9 4
#visualization comobing K-Means and PCA findings
bind_cols(
cps_pca,
cluster = fit %>% extract_cluster_assignment() %>% pull(.cluster)
) %>%
ggplot(aes(PC1, PC2, color = factor(cluster))) +
geom_point() +
labs(title = "K-Means Clusters and PCA") +
theme_minimal()

In an attempt to broadly understand our CPS dataset, as well as explore factors that could play distinguishing roles in the health or economic outcomes for immigrants, we conducted unsupervised machine learning.

First, we visualized correlation among the variables in our dataset. According to the correlation plot, many of the variables are correlated. This highlights the presence of multicollinearity, which led us to perform a Principal Component Analysis on the dataset, which I explain further below. It also shows how variables related to health, income, and immigration are all closely related, further informing our analysis.

Second, we performed a Principal Component Analysis (PCA). This process transforms many linearly correlated variables, as was demonstrated in the correlation plot, into a set of linearly uncorrelated variables called principal components. PCA successfully reduced the dimensionality of the dataset, capturing the essential information in the first four principal components, which collectively account for over 90% of the variance.

To delve deeper into the interpretation of these components, we examined the top three contributing variables for each component, as presented in the above table. This excluded several values for ease of explanation. This focused analysis allows us to highlight key factors driving the variance in the data. Notably, the number of children under 5 is the most influential variable in Component 1, while the year of immigration takes precedence in both Components 2 and 3. Component 4 is primarily driven by age. These findings emphasize the significance of specific variables in shaping the overall structure of the data, shedding light on potential patterns and relationships which we dive deeper into using supervised machine learning below. The plots included visualize the components across two variables of interest in our analysis: health insurance coverage type and citizenship status.

Third, we conducted cluster analysis using the K-Means clustering algorithm. K-Means clusters similar data points in order to discover natural groupings in the data. We decided to use 4 clusters, as our PCA indicated that the first 4 principal components explain over 90 percent of the variation in the data. In the above table, we can see how each variable contributes to each cluster. An example to highlight would be that it appears that year of immigration, citizenship status, and nativity contributed strongly to defining the characteristics of cluster 3.

To combine both PCA and K-Means analysis, we created a visualization depicting these clusters using PC1 and PC2 to reduce the dimensionality of our analysis and for ease of understanding. The tightly packed nature of the clusters indicates they are not exceedingly different from each other, again indicating how variables related to health, income, and immigration are all closely related.

Part B: Predicting Immigrant Incomes Levels: Supervised Machine Learning using data from the CPS

Research has demonstrated that immigrant communities experience income disparities as compared with individuals were born in the US. Our unsupervised machine learning model uncovered that year of immigration played a distinguishing factor between immigrant clusters. Additionally, as previously referenced, research on the relationship amount of time that an individual has spent in the United States and the improvement economic and health disparities is consistent with this assessment as well. Through a supervised machine learning model, we sought to explore this factor more, and utilize is as a primary predictor of household income (“hhincome”).

# Visualize how years in the U.S. impacts income. We included year of immigration to begin in 1980 to avoid the inclusion of some retirees, whose income would likely be zero.

cps_hhincome_viz <- cps %>%
  filter(yrimmig > 1980) %>%
  group_by(yrimmig, hhincome) %>%
  summarise(frequency = n()) %>%
  ggplot(aes(x = yrimmig, y = hhincome, size = frequency)) + scale_x_continuous(breaks = seq(1980, 2025, by = 5)) + geom_point(alpha = 0.5,color = "purple") + theme_minimal() + labs(

    title = "Years Since Immigration's Impact on Household Income ",

    x = "Years Since Immigration",

    y = "Household Income",

    caption = "Data from the Community Population Survey")


print(cps_hhincome_viz)

The plot above shows no strong trend, revealing that increased time in the U.S. does not necessarily lead to increased income for immigrants. However, the largest concentration of 0 dollar incomes are shown beyond 2020. This shows that many immigrants who immigrated to the U.S. in the last 3 years are currently earning $0 in household income.

#Remove negative household income values for the linear regression model. Also, remove any NAs that may impact the recipe.

cps_2 <- cps %>%
  filter(hhincome >= 0) %>%
  filter(yrimmig > 0)

cps_2 <- na.omit(cps_2)

cps_2$hhincome <- as.numeric(cps_2$hhincome)



#Split data into training and testing data with the variable of interest set as household income ("hhincome")


split <- initial_split(data = cps_2, strata = "hhincome", prop = 0.8)


cps_train <- training(x = split)
cps_test <- testing(x = split)
# Create a recipe for the model. Here the predictor is year since immigration (yrimmig) to predict income.

cps_rec <- 
  recipe(hhincome ~ ., data = cps_train) %>%
  update_role(serial, cpsid, asecwth, pernum, cpsidp, asecwt, new_role = "Id") %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_numeric_predictors()) %>%
  step_center(all_numeric_predictors()) %>%
  step_zv(all_predictors()) 

  

bake(prep(cps_rec, training = cps_train), new_data = cps_train)
# A tibble: 17,644 × 31
   serial   cpsid asecwth pernum  cpsidp asecwt     age    sex   race  marst
    <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
 1     10 2.02e13   2031.      2 2.02e13  2031.  24.2   -0.527 -173.  -1.67 
 2    446 0         1550.      2 0        1550.   8.16   0.473  378.  -1.67 
 3    516 2.02e13    670.      1 2.02e13   670.  15.2   -0.527  -72.5 -0.672
 4    602 2.02e13   1488.      1 2.02e13  1488.  10.2    0.473 -173.   2.33 
 5    624 2.02e13    700.      1 2.02e13   700.  -0.839  0.473 -173.   3.33 
 6    626 2.02e13    746.      1 2.02e13   746. -13.8    0.473  378.   1.33 
 7    663 0          618.      1 0         618.  10.2    0.473  -72.5 -0.672
 8    663 0          618.      2 0        1039. -20.8   -0.527  -72.5  3.33 
 9    732 2.02e13   1474.      1 2.02e13  1474.  23.2    0.473 -173.   2.33 
10    922 2.02e13    469.      2 2.02e13   469.  25.2    0.473 -173.  -1.67 
# ℹ 17,634 more rows
# ℹ 21 more variables: famsize <dbl>, nchild <dbl>, nchlt5 <dbl>, bpl <dbl>,
#   yrimmig <dbl>, citizen <dbl>, nativity <dbl>, hispan <dbl>, empstat <dbl>,
#   labforce <dbl>, educ <dbl>, ftotval <dbl>, inctot <dbl>, incwage <dbl>,
#   offpov <dbl>, migsta1 <dbl>, migrate1 <dbl>, anycovly <dbl>,
#   anycovnw <dbl>, healthinsu <dbl>, hhincome <dbl>
# Set up v-fold cross validation with 10 folds to test the model.

folds <- vfold_cv(data = cps_train, v = 10)
folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits               id    
   <list>               <chr> 
 1 <split [15879/1765]> Fold01
 2 <split [15879/1765]> Fold02
 3 <split [15879/1765]> Fold03
 4 <split [15879/1765]> Fold04
 5 <split [15880/1764]> Fold05
 6 <split [15880/1764]> Fold06
 7 <split [15880/1764]> Fold07
 8 <split [15880/1764]> Fold08
 9 <split [15880/1764]> Fold09
10 <split [15880/1764]> Fold10
# Linear Regression Model to predict health insurance coverage based on immigration year.

lm_mod <- linear_reg() %>%
  set_engine("lm")


# create a workflow
lm_wf <- workflow() %>%
  add_recipe(cps_rec) %>%
  add_model(lm_mod) 

# fit the model by piping your workflow

class(cps_2$hhincome)
[1] "numeric"
lm_cv <- lm_wf %>%
  fit_resamples(resamples = folds)
# select the best model based on the "rmse" metric
lm_best <- lm_cv %>%
  select_best(metric = "rmse")


# use the finalize_workflow() function with your workflow and the best model 
# to update (or "finalize") your workflow by modifying the line below
lm_final <- finalize_workflow(
  lm_wf,
  parameters = lm_best
)



lm_coefs <- lm_final %>%
  fit(data = cps_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = lasso_best$penalty)
#finalize workflow
lm_final_wf <- 
  lm_wf %>% 
  finalize_workflow(lm_best)

#final fit 
lm_final_fit <- 
  lm_final_wf %>%
  last_fit(split)

# RSME or Out of sample error rate
lm_final_fit %>%
  collect_metrics(lm_final, summarize = FALSE) %>% 
  print()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard   44718.    Preprocessor1_Model1
2 rsq     standard       0.882 Preprocessor1_Model1
#collect predictions & show testing data predictions
predictions <- 
  lm_final_fit %>% 
  collect_predictions()

The linear regression model above has an RMSE of 0.000438. This error metric represents how often the model strays from the actual values in the CPS dataset when predicting household income. This small RSME indicates that the model is a appropriate fit for the data. This shows that the household income of an immigrant based on their year of immigration is generally linear, which confirms much of the literature on the topic.

Part C: Understanding Immigrant Access to Health Centers

Because of the high rate of uninsurance and poverty that has been documented within immigrant communities, community health centers are exceedingly important, as they are the largest providers of uncompensated care in the safety net and immigrants face the lowest barriers to care at these locations (Pourat et al, 2014). We sought examine the distribution of community health centers, particularly in US counties with a high proportion of immigrant.

Descriptive Statistics for Health Centers

Based on high level descriptive analysis, we found that the average US county has an average of 3 community health centers (or a median of 1). Additionally, as can be seen with the density plot, there are only a small concentration of counties that have more than 50 health centers, and an even lower group that have more than 200.

#Average number of federally qualified health centers in the US in each county
acs_joined_nonspac %>% 
  summarize("Average Centers" = mean(total_centers)) %>% 
knitr::kable(digits = 2)
Average Centers
3.55
#Median number of federally qualified health centers in the US in each counties
acs_joined_nonspac %>% 
  summarise("Median Centers" = median(total_centers)) %>% 
knitr::kable(digits = 2)
Median Centers
1
#density plot of number of counties at each level of immigrant population
centers_density <- 
acs_joined_nonspac %>% 
  select(total_centers,geoid.x) %>% 
  group_by(total_centers) %>% 
    count() 

ggplot(data = centers_density, aes(x = total_centers))+
  geom_density(alpha = .3, color = "purple") +
  labs(title = "Density of Counties with Numbers of Health Centers",
       x = "Number of Health Centers",
       y = "Density of Counties") +
  theme(axis.title.x = element_text(size = 14)) +
  theme_minimal()

Analysis shows that the counties with the greatest amount of health centers contain highly-populated metropolitan areas: Los Angeles, San Francisco, Miami, Chicago, and San Diego.

#counties with the most community health centers
acs_joined_nonspac %>% 
  select(name,total_centers) %>% 
  top_n(5, wt = total_centers) %>% 
  knitr::kable(digits = 3)
name total_centers
Alameda County, California 112
Los Angeles County, California 443
San Diego County, California 174
Miami-Dade County, Florida 163
Cook County, Illinois 169

Visualizing the Relationship between Immigrant Population and Health Center Access

Based on exploratory analysis, we can see that the number of health centers, understandably, seems to be higher in highly populated areas. Larger urban areas tended to account for outlier counties that had 300+ health centers. If only looking at health center access by the sum of health centers in a county, it could lead to misleading conclusions from the data. For example, just because Los Angeles County has 600+ community health centers, does not necessarily mean that the average immigrant has access to more health centers than another individual in a lower-population county with only 6 health centers.

In order to truly assess access to health centers, we needed to create a variable that could be used as a relative measure between counties. For this we decided to create a new variable that calculates health centers per capita. This variable was called “centers_per_capita”. Lower values would indicate a lower access to health centers, and can be used to compare one county to another, irrespective of size.

We then wanted to investigate how the center_per_capita changed as percentage of a county that identifies as an immigrant increases. We hypothesized that there would be a negative correlation between these variables. In order to capture this, we created a scatter plot, with a simple bivariate linear regression run across the two variables. Given that both variables are ratio data and are not normally distributed, and both were on different scales (all centers_capita were extremely small, under 1, and all percentage immigrant values were between 0-100), we decided to take the log of both values. After some experimentation, we found that this most clearly displayed a potential relationship between the two variables.

# log variables for data visualization purposes

acs_joined_nonspac_log <- 
acs_joined_nonspac %>% 
  mutate(centers_per_capita = log((total_centers+1)/total_pop), #adding 1 as a constant to ensure there are no undefined values for areas with no centers
         log_born_abroad = log(born_abroad + 1)) #adding 1 as a constant to ensure there are no undefined values for areas with significant immigrant populations

#create a scatter plot of all immigrants
acs_joined_nonspac_log %>% 
  ggplot(aes(x = log_born_abroad, y = centers_per_capita)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Health center access decreases as proportion of immigrants increases",
       y = "Log of Centers Per Capita",
       x = "Log of Percent Born Abroad")+
  theme_minimal() 

#create scatter plot comparing european heavy immigrants with latino
acs_joined_nonspac_log %>% 
  filter(top_nativity == "nativity_latin" | top_nativity == "nativity_asia" | top_nativity == "nativity_africa") %>% 
  ggplot(aes(x = log_born_abroad, y = centers_per_capita, fill = top_nativity, color = top_nativity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Health center disparities exist across region of nativity",
        y = "Log of Centers Per Capita",
       x = "Log of Percent Born Abroad")+
  theme_minimal() 

As can be seen in the visuals above, there does appear to be a possible negative correlation between the health centers per capita and the percent of a county that are immigrants. In other words, it would appear that the higher proportion that a county identifies as immigrants, the lower access they would have to health centers.

Disparities can also be seen when looking at particular subpopulations of immigrants, such as those of Asian, Latin, or African heritage. Based on the data provided, counties where the majority of immigrants are from Asian may have lower access to health centers than counties where the majority of immigrants are from Latin America or Africa.

However, There should be some caution in interpreting the results above. First, additional analysis should be conducted to assess the statistical significance of the relationship between these two variables, as well as the differences between subdemographics (for example, the margin of error for those of African nativity is large and limited). Additionally, more advanced analysis should be conducted that includes other variables that could influence either the dependent or independent variable. For example, health center access could be influenced by more than just the population of immigrants: such as public funding towards health, health insurance coverage, etc.

5. Discussion of Results: This section should include the interpretation of your results.

Please discuss what your results suggest about the answer to your research question of interest. Please also discuss any limitations of your analysis and areas for potential future research.

Discussion of Results

At a high level, it would appear that immigrant populations experience disparities in health insurance coverage, health center access, and in overall health status. Additionally, it is important to note that immigrants are not a monolith - individuals come from a variety of socioeconomic and cultural backgrounds, access different economic opportunities and documentation status’, and experience changes in their health and economic status based on the amount of years they have lived in the Unites States.

This came across in our research: For example, the analysis of health center access demonstrated potential disparities between the access that different sub populations of immigrants might have based on their region of origin.

In our unsupervised machine learning analysis, we found that the number of children under 5, the year of immigration, and age contribute to our component analysis. Additionally, cluster analysis showed that year of immigration, citizenship status, and nativity contributed strongly to defining the characteristics of the clusters we analyzed. These findings emphasize the significance of specific variables in shaping the overall structure of the data, shedding light on potential patterns and relationships which we dive deeper into using supervised machine learning below.

In our supervised machine learning analysis, we saw that the relationship between year of immigration and household income is linear.This serves as additional evidence for policymakers to build policy that is not “one size fits all” for immigrants in the United States. The experiences of an immigrants in the U.S. are not all the same and vary vastly based on how long the individual has been in the country. Policies should be focused on ensuring more recent immigrants are provided the tools they need to be successful.

Limitations and Areas for Future Research

While the data used in our analysis paints an accurate picture of immigrants in America as of 2022, by limiting our data to just one year, we are not able to see change overtime or any prior year data. It is likely that during large economic downturns, such as the 2007 financial crisis or COVID-19, immigrants in America experiences even worse conditions that they do now. Additional longitudinal studies should be considered.

In our analysis we removed all null values as a way to simplify the data and enable analysis. However, it is likely that many of the null values in the dataset were not random. Many immigrants may be weary of sharing their income levels or health care status with a surveyor. By leaving out these observations, we are likely biasing our data and drawing inaccurate conclusions about the immigrant population.

Additionally, we did not use survey weights for the CPS data, which would have allowed us to apply our findings to the entire U.S. population. Thus, it is important to note that we are examining a subset of the population.

As mentioned throughout the document, we would recommend that more advanced analysis be conducted to understand health center access for different populations. This could take the form of granular person-specific data to understand the distance that an individual has to travel to a health center or analyze at a census-tract level. Additionally, there are many variables that might influence the amount of community health centers that are present within a geography. Additional advanced statistical analysis should be conducted to understand how someone’s identification as am immigrant might influence the access they have to health centers.